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Abstract 

We present an extension of Principal Component Analysis (PCA) and a new algorithm for 
clustering points in R™ based on it. The key property of the algorithm is that it is afhne-invariant. 
When the input is a sample from a mixture of two arbitrary Gaussians, the algorithm correctly 
classifies the sample assuming only that the two components are separable by a hyperplane, 
i.e., there exists a halfspace that contains most of one Gaussian and almost none of the other 
in probability mass. This is nearly the best possible, improving known results substantially 
[HI [9l [T]. For k > 2 components, the algorithm requires only that there be some (k — 1)- 
dimensional subspace in which the overlap in every direction is small. Here we define overlap to 
be the ratio of the following two quantities: 1) the average squared distance between a point and 
the mean of its component, and 2) the average squared distance between a point and the mean of 
the mixture. The main result may also be stated in the language of linear discriminant analysis: 
if the standard Fisher discriminant [5] is small enough, labels are not needed to estimate the 
optimal subspace for projection. Our main tools are isotropic transformation, spectral projection 
and a simple reweighting technique. We call this combination isotropic PCA. 
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1 Introduction 



We present an extension to Principal Component Analysis (PCA), which is able to go beyond 
standard PCA in identifying "important" directions. When the covariance matrix of the input 
(distribution or point set in R n ) is a multiple of the identity, then PCA reveals no information; the 
second moment along any direction is the same. Such inputs are called isotropic. Our extension, 
which we call isotropic PCA, can reveal interesting information in such settings. We use this 
technique to give an affine- invariant clustering algorithm for points in R n . When applied to the 
problem of unraveling mixtures of arbitrary Gaussians from unlabeled samples, the algorithm yields 
substantial improvements of known results. 

To illustrate the technique, consider the uniform distribution on the set X = {(x,y) 6 M 2 : x € 
{— 1, l},y G [— V$\}, which is isotropic. Suppose this distribution is rotated in an unknown 
way and that we would like to recover the original x and y axes. For each point in a sample, we 
may project it to the unit circle and compute the covariance matrix of the resulting point set. The 
x direction will correspond to the greater eigenvector, the y direction to the other. See Figure Q] for 
an illustration. Instead of projection onto the unit circle, this process may also be thought of as 
importance weighting, a technique which allows one to simulate one distribution with another. In 
this case, we are simulating a distribution over the set X, where the density function is proportional 
to (1 + y 2 )" 1 , so that points near (1,0) or (—1,0) are more probable. 
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Figure 1: Mapping points to the unit circle and then finding the direction of maximum variance 
reveals the orientation of this isotropic distribution. 

In this paper, we describe how to apply this method to mixtures of arbitrary Gaussians in R n 
in order to find a set of directions along which the Gaussians are well-separated. These directions 
span the Fisher subspace of the mixture, a classical concept in Pattern Recognition. Once these 
directions are identified, points can be classified according to which component of the distribution 
generated them, and hence all parameters of the mixture can be learned. 

What separates this paper from previous work on learning mixtures is that our algorithm is 
affine-invariant. Indeed, for every mixture distribution that can be learned using a previously 
known algorithm, there is a linear transformation of bounded condition number that causes the 
algorithm to fail. For k = 2 components our algorithm has nearly the best possible guarantees (and 
subsumes all previous results) for clustering Gaussian mixtures. For k > 2, it requires that there 
be a {k — l)-dimensional subspace where the overlap of the components is small in every direction 
(See section fl.2p . This condition can be stated in terms of the Fisher discriminant, a quantity 
commonly used in the field of Pattern Recognition with labeled data. Because our algorithm is 
affine invariant, it makes it possible to unravel a much larger set of Gaussian mixtures than had 
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been possible previously. 

The first step of our algorithm is to place the mixture in isotropic position (see Section II. 2D via 
an affine transformation. This has the effect of making the (k— l)-dimensional Fisher subspace, i.e., 
the one that minimizes the Fisher discriminant, the same as the subspace spanned by the means of 
the components (they only coincide in general in isotropic position), for any mixture. The rest of 
the algorithm identifies directions close to this subspace and uses them to cluster, without access to 
labels. Intuitively this is hard since after isotropy, standard PCA reveals no additional information. 
Before presenting the ideas and guarantees in more detail, we describe relevant related work. 

1.1 Previous Work 

A mixture model is a convex combination of distributions of known type. In the most commonly 
studied version, a distribution F in R n is composed of k unknown Gaussians. That is, 

F = w 1 N(fi 1 ,Y, 1 ) + . . . + w k N(ji k , S fc ), 

where the mixing weights Wj, means /Xj, and covariance matrices £j are all unknown. Typically, 
k -C n, so that a concise model explains a high dimensional phenomenon. A random sample is 
generated from F by first choosing a component with probability equal to its mixing weight and 
then picking a random point from that component distribution. In this paper, we study the classical 
problem of unraveling a sample from a mixture, i.e., labeling each point in the sample according 
to its component of origin. 

Heuristics for classifying samples include "expectation maximization" [5] and "k-means cluster- 
ing" These methods can take a long time and can get stuck with suboptimal classifications. 
Over the past decade, there has been much progress on finding polynomial-time algorithms with 
rigorous guarantees for classifying mixtures, especially mixtures of Gaussians [H H5j HU H7J EJ [JJ. 
Starting with Dasgupta's paper [3J, one line of work uses the concentration of pairwise distances 
and assumes that the components' means are so far apart that distances between points from the 
same component are likely to be smaller than distances from points in different components. Arora 
and Kannan [14] establish nearly optimal results for such distance-based algorithms. Unfortunately 
their results inherently require separation that grows with the dimension of the ambient space and 
the largest variance of each component Gaussian. 

To see why this is unnatural, consider k well-separated Gaussians in R fc with means ei, . . . , e k , 
i.e. each mean is 1 unit away from the origin along a unique coordinate axis. Adding extra 
dimensions with arbitrary variance does not affect the separability of these Gaussians, but these 
algorithms are no longer guaranteed to work. For example, suppose that each Gaussian has a 
maximum variance of e <C 1. Then, adding 0*(ke~ 2 ) extra dimensions with variance e will violate 
the necessary separation conditions. 

To improve on this, a subsequent line of work uses spectral projection (PCA). Vempala and 
Wang [T7] showed that for a mixture of spherical Gaussians, the subspace spanned by the top k 
principal components of the mixture contains the means of the components. Thus, projecting to 
this subspace has the effect of shrinking the components while maintaining the separation between 
their means. This leads to a nearly optimal separation requirement of 

WfJ'i — fJ'jW > ^(fc 1 / 4 ) max{cjj, <Tj} 

where /i, is the mean of component i and af is the variance of component i along any direction. 
Note that there is no dependence on the dimension of the distribution. Kannan et al. [9] applied the 
spectral approach to arbitrary mixtures of Gaussians (and more generally, logconcave distributions) 
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(a) Distance Concentration Separa- (b) Hyperplane Separability (c) Intermean Hyperplane and 
bility Fisher Hyperplane. 



Figure 2: Previous work requires distance concentration separability which depends on the max- 
imum directional variance (a). Our results require only hyperplane separability, which depends 
only on the variance in the separating direction(b). For non-isotropic mixtures the best separating 
direction may not be between the means of the components(c). 



and obtained a separation that grows with a polynomial in k and the largest variance of each 
component: 

1 1 Mi - Mill > poly (A;) max{<7 ijmax , Oj imax } 

where of max is the maximum variance of the ith component in any direction. The polynomial in 
k was improved in pQ along with matching lower bounds for this approach, suggesting this to be 
the limit of spectral methods. Going beyond this "spectral threshold" for arbitrary Gaussians has 
been a major open problem. 

The representative hard case is the special case of two parallel "pancakes", i.e., two Gaussians 
that are spherical in n—1 directions and narrow in the last direction, so that a hyperplane orthogonal 
to the last direction separates the two. The spectral approach requires a separation that grows with 
their largest standard deviation which is unrelated to the distance between the pancakes (their 
means). Other examples can be generated by starting with Gaussians in k dimensions that are 
separable and then adding other dimensions, one of which has large variance. Because there is a 
subspace where the Gaussians are separable, the separation requirement should depend only on the 
dimension of this subspace and the components' variances in it. 

A related line of work considers learning symmetric product distributions, where the coordinates 
are independent. Feldman et al [6] have shown that mixtures of axis-aligned Gaussians can be 
approximated without any separation assumption at all in time exponential in k. A. Dasgupta 
et al [3] consider heavy-tailed distributions as opposed to Gaussians or log-concave ones and give 
conditions under which they can be clustered using an algorithm that is exponential in the number 
of samples. Chaudhuri and Rao [2] have recently given a polynomial time algorithm for clustering 
such heavy tailed product distributions. 

1.2 Results 

We assume we are given a lower bound w on the minimum mixing weight and k, the number 
of components. With high probability, our algorithm Unravel returns a partition of space by 
hyperplanes so that each part (a polyhedron) encloses almost all of the probability mass of a single 
component and almost none of the other components. The error of such a set of polyhedra is the 
total probability mass that falls outside the correct polyhedron. 

We first state our result for two Gaussians in a way that makes clear the relationship to previous 
work that relies on separation. 



Theorem 1. Let wi, fi±, Ej. and W2, fJ.2, ^2 define a mixture of two Gaussians. There is an absolute 
constant C such that, if there exists a direction v such that 



Iproj^Mi -H2)\>C (Vv^v + y/v T X 2 v) w" 2 log 1 / 2 f-^ + ~J , 

i/ien with probability 1 — 5 algorithm Unravel returns two complementary halfspaces that have 
error at most r/ using time and a number of samples that is polynomial in n, w -1 , log(l/<5). 

So the separation required between the means is comparable to the standard deviation in 
some direction. This separation condition of Theorem [T] is afhne-invariant and much weaker than 
conditions of the form \\fii — H2W > rnax{cri jmax , cJ2,max} used in previous work. See Figure [2j The 
dotted line shows how previous work effectively treats every component as spherical. We also note 
that the separating direction does not need to be the intermean direction as illustrated in Figure 



2(c) The dotted line illustrates hyperplane induced by the intermean direction, which may be far 
from the optimal separating hyperplane shown by the solid line. 

It will be insightful to state this result in terms of the Fisher discriminant, a standard notion 
from Pattern Recognition [HI [7] that is used with labeled data. In words, the Fisher discriminant 
along direction p is 

j. the intra-component variance in direction p 
the total variance in direction p 

Mathematically, this is expressed as 

E [l|proj p (x - M( x ))\\ 2 ] p T (wiIa + W2S 2 )p 



J(p) 



E [HprojpO^II 2 ] p T (wi(Si +Hi(ii) +w 2 (X!2 + V-2vl))P 



for x distributed according to a mixture distribution with means //j and covariance matrices 
We use £(x) to indicate the component from which x was drawn. 

Theorem 2. There is an absolute constant C for which the following holds. Suppose that J- is a 
mixture of two Gaussians such that there exists a direction p for which 



jMsow-ior^-L + i). 



With probability 1 — 5, algorithm Unravel returns a half space with error at most r] using time and 
sample complexity polynomial in n,w _1 ,log(l/<5). 

There are several ways of generalizing the Fisher discriminant for k = 2 components to greater 
k [7]. These generalizations are most easily understood when the distribution is isotropic. An 
isotropic distribution has the identity matrix as its covariance and the origin as its mean. An 
isotropic mixture therefore has 

k k 

w»/ii = and Wj(Sj + muf) = I. 
i=i i=i 

It is well known that any distribution with bounded covariance matrix (and therefore any mixture) 
can be made isotropic by an affine transformation. As we will see shortly, for k = 2, for an isotropic 
mixture, the line joining the means is the direction that minimizes the Fisher discriminant. 
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Under isotropy, the denominator of the Fisher discriminant is always 1. Thus, the discriminant 
is just the expected squared distance between the projection of a point and the projection of its 
mean, where projection is onto some direction p. The generalization to k > 2 is natural, as we may 
simply replace projection onto direction p with projection onto a (k — 1) -dimensional subspace S. 
For convenience, let 

k 

E = ^WjSj. 

i=l 

Let the vector p\, ■ ■ ■ ,Pk-i be an orthonormal basis of S and let £(x) be the component from which 
x was drawn. We then have under isotropy 

fc-1 

J(S) = E[\\proj s (x - M/(a,))|| 2 ] = ^pj^Pj 

3=1 

for x distributed according to a mixture distribution with means Hi and covariance matrices Ej. 
As E is symmetric positive definite, it follows that the smallest k — 1 eigenvectors of the matrix are 
optimal choices of pj and S is the span of these eigenvectors. 

This motivates our definition of the Fisher subspace for any mixture with bounded second 
moments (not necessarily Gaussians). 

Definition 1. Let {wj,/ij,Ej} be the weights, means, and covariance matrices for an isotropic^ 
mixture distribution with mean at the origin and where dim(span{/ii, . . . , /ifc}) = k — 1. Let £(x) be 
the component from which x was drawn. The Fisher subspace F is defined as the (k— 1) -dimensional 
subspace that minimizes 

J(S) = E[\\ P roj s (x- f i l{x) )f]. 
over subspaces S of dimension k — 1. 

Note that dim(span{/ii, . . . , /x^}) is only k — 1 because isotropy implies Yli=i w if 1 i = 0- The 
next lemma provides a simple alternative characterization of the Fisher subspace as the span of the 
means of the components (after transforming to isotropic position). The proof is given in Section 
13^1 

Lemma 1. Suppose {wj, in, Sj}^ =1 defines an isotropic mixture in W 1 . Let Ai > . . . > A n be the 
eigenvalues of the matrix E = Yli=i w «£« an ^ ^ v\,...,v n be the corresponding eigenvectors. If 
the dimension of the span of the means of the components is k — 1, then the Fisher subspace 

F = span{u n _ fc+ i, ...,v n } = span-jjUi, . . . ,fj, k }. 

Our algorithm attempts to find the Fisher subspace (or one close to it) and succeeds in doing 
so, provided the discriminant is small enough. 

The next definition will be useful in stating our main theorem precisely. 

Definition 2. The overlap of a mixture given as in Definitional} is 

6 = min maxp T Sp. (1) 

S:dim(S)=*-l 

1 For non-isotropic mixtures, the Fisher discriminant generalizes to Y^!j=x Pj (X^iLi w i(^i + Hi^lij ^Pj an d the 
overlap to p T fe* =1 Wj(£j + Hifxf)) Sp 
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It is a direct consequence of the Courant-Fisher min-max theorem that <f> is the (k— l)th smallest 
eigenvalue of the matrix E and the subspace achieving <j) is the Fisher subspace, i.e., 



We can now state our main theorem for k > 2. 

Theorem 3. There is an absolute constant C for which the following holds. Suppose that T is a 
mixture of k Gaussian components where the overlap satisfies 



With probability 1 — 5, algorithm Unravel returns a set of k polyhedra that have error at most rj 
using time and a number of samples that is polynomial in n,w _1 ,log(l/5). 

In words, the algorithm successfully unravels arbitrary Gaussians provided there exists a {k— 1)- 
dimensional subspace in which along every direction, the expected squared distance of a point to 
its component mean is smaller than the expected squared distance to the overall mean by roughly a 
poly(/c, 1/w) factor. There is no dependence on the largest variances of the individual components, 
and the dependence on the ambient dimension is logarithmic. This means that the addition of 
extra dimensions (even where the distribution has large variance) as discussed in Section 11.11 has 
little impact on the success of our algorithm. 

2 Algorithm 

The algorithm has three major components: an initial affine transformation, a reweighting step, 
and identification of a direction close to the Fisher subspace and a hyperplane orthogonal to this 
direction which leaves each component's probability mass almost entirely in one of the halfspaces 
induced by the hyperplane. The key insight is that the reweighting technique will either cause the 
mean of the mixture to shift in the intermean subspace, or cause the top k — 1 principal components 
of the second moment matrix to approximate the intermean subspace. In either case, we obtain a 
direction along which we can partition the components. 

We first find an affine transformation W which when applied to T results in an isotropic 
distribution. That is, we move the mean to the origin and apply a linear transformation to make 
the covariance matrix the identity. We apply this transformation to a new set of m\ points {xi} 
from T and then reweight according to a spherically symmetric Gaussian exp(— ||x|| 2 /(2a)) for 
a = 0(n/w). We then compute the mean u and second moment matrix M of the resulting set. H 

After the reweighting, the algorithm chooses either the new mean or the direction of maximum 
second moment and projects the data onto this direction h. By bisecting the largest gap between 
points, we obtain a threshold t, which along with h defines a hyperplane that separates the com- 
ponents. Using the notation Hh,t = {x £ M. n : h T x > t}, to indicate a halfspace, we then recurse 
on each half of the mixture. Thus, every node in the recursion tree represents an intersection of 
half-spaces. To make our analysis easier, we assume that we use different samples for each step 
of the algorithm. The reader might find it useful to read Section 12.11 which gives an intuitive 
explaination for how the algorithm works on parallel pancakes, before reviewing the details of the 
algorithm. 

2 This practice of transforming the points and then looking at the second moment matrix can be viewed as a form 
of kernel PCA; however the connection between our algorithm and kernel PCA is superficial. Our transformation 
does not result in any standard kernel. Moreover, it is dimension-preserving (it is just a reweighting), and hence the 
"kernel trick" has no computational advantage. 
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Algorithm 1 Unravel 

Input: Integer k, scalar w. Initialization: P = R n . 

1. (Isotropy) Use samples lying in P to compute an affine transformation W that makes the 
distribution nearly isotropic (mean zero, identity covariance matrix). 

2. (Reweighting) Use mi samples in P and for each compute a weight e - ^" 2 /^ (where a > 
n/w). 

3. (Separating Direction) Find the mean of the reweighted data /t. If \\p,\\ > y / w/(32a), let 
h = ft. Otherwise, find the covariance matrix M of the reweighted points and let h be its top 
principal component. 

4. (Recursion) Project ni2 sample points to h and find the largest gap between points in the 
interval [—1/2, 1/2]. If this gap is less than l/4(k — 1), then return P. Otherwise, set t to be 
the midpoint of the largest gap, recurse on P n Hh,t an d P H H-h-t, and return the union of 
the polyhedra produces by these recursive calls. 



2.1 Parallel Pancakes 

The following special case, which represents the open problem in previous work, will illuminate the 
intuition behind the new algorithm. Suppose T is a mixture of two spherical Gaussians that are 
well-separated, i.e. the intermean distance is large compared to the standard deviation along any 
direction. We consider two cases, one where the mixing weights are equal and another where they 
are imbalanced. 

After isotropy is enforced, each component will become thin in the intermean direction, giving 
the density the appearance of two parallel pancakes. When the mixing weights are equal, the means 
of the components will be equally spaced at a distance of 1 — <f> on opposite sides of the origin. For 
imbalanced weights, the origin will still lie on the intermean direction but will be much closer to 
the heavier component, while the lighter component will be much further away. In both cases, this 
transformation makes the variance of the mixture 1 in every direction, so the principal components 
give us no insight into the inter-mean direction. 

Consider next the effect of the reweighting on the mean of the mixture. For the case of equal 
mixing weights, symmetry assures that the mean does not shift at all. For imbalanced weights, 
however, the heavier component, which lies closer to the origin will become heavier still. Thus, 
the reweighted mean shifts toward the mean of the heavier component, allowing us to detect the 
intermean direction. 

Finally, consider the effect of reweighting on the second moments of the mixture with equal 
mixing weights. Because points closer to the origin are weighted more, the second moment in every 
direction is reduced. However, in the intermean direction, where part of the moment is due to 
the displacement of the component means from the origin, it shrinks less. Thus, the direction of 
maximum second moment is the intermean direction. 

2.2 Overview of Analysis 

To analyze the algorithm, in the general case, we will proceed as follows. Section [3] shows that under 
isotropy the Fisher subspace coincides with the intermean subspace (Lemma [T]), gives the necessary 
sampling convergence and perturbation lemmas and relates overlap to a more conventional notion 
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of separation (Prop. [5]). Section T3.3I gives approximations to the first and second moments. Section 
H] then combines these approximations with the perturbation lemmas to show that the vector h 
(either the mean shift or the largest principal component) lies close to the intermean subspace. 
Finally, Section [5] shows the correctness of the recursive aspects of the algorithm. 



3 Preliminaries 

3.1 Matrix Properties 

For a matrix Z, we will denote the ith largest eigenvalue of Z by Xi(Z) or just Aj if the matrix is 
clear from context. Unless specified otherwise, all norms are the 2-norm. For symmetric matrices, 
this is \\Z\\2 = Ai(Z) = max xg Kn ||Zx||2/||x||2. 

The following two facts from linear algebra will be useful in our analysis. 

Fact 2. Let Ai > . . . > A n be the eigenvalues for an n-by-n symmetric positive definite matrix Z 
and let v\ , . . . v n be the corresponding eigenvectors. Then 

k 

A n + . . . + Xn-k+i = min y^pJZpj, 

5:dim(5)=fc^ 3 J 

where {pj} is any orthonormal basis for S. If \ n -k > A n _fc + i ; then span{f n , . . . ,v n ^k+i} is the 
unique minimizing subspace. 

Recall that a matrix Z is positive semi-definite if x T Zx > for all non-zero x. 

Fact 3. Suppose that the matrix 



A B T 
B D 



Z 

is symmetric positive semi-definite and that A and D are square submatrices. Then \\B\\ < 

Proof. Let y and x be the top left and right singular vectors of B, so that y T Bx = \\B\\. Because 
Z is positive semi-definite, we have that for any real 7, 

< [jx T y T ]Z[jx T y T ] T = j 2 x T Ax + 2-/y T Bx + y T Dy. 

This is a quadratic polynomial in 7 that can have only one real root. Therefore the discriminant 
must be non-positive: 

> 4(y T Bx) 2 - A{x T Ax)(y T Dy). 
We conclude that 



\B\\ = y T Bx < yj (x T 'Ax)(y 7 ' Dy) < y/\\A\\\\D\\. 

□ 



3.2 The Fisher Criterion and Isotropy 

We begin with the proof of the lemma that for an isotropic mixture the Fisher subspace is the same 
as the intermean subspace. 
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Proof of LemmaUl By definition for an isotropic distribution, the Fisher subspace minimizes 



J(S) = E[\\pio] s (x - M(x))f] = ^2pJ^P. 



fe-i 



where {pj} is an orthonormal basis for S. 

By Fact [2] one minimizing subspace is the span of the smallest k — 1 eigenvectors of the matrix 
S, i.e. i> n _fc+2, • • • ,v n . Because the distribution is isotropic, 

k 

E = I - ^2 WiHivJ, 

i=l 

and these vectors become the largest eigenvectors of J2i=i w ifHf^i ' ■ Clearly, span{t; n _fc+2, • • • , v n } C 
spanj^i, . . . , /ifc}, but both spans have dimension k — 1 making them equal. 

Since u n .-fc+i must be orthogonal to the other eigenvectors, it follows that X n -k+i = 1 > \ n -k+2- 
Therefore, span{f„_fc+2, . . . , v n } C spanj^i, . . . , fik} is the unique minimizing subspace. □ 

It follows directly that under the conditions of Lemma dj the overlap may be characterized as 

(f> = K-k+2 (S) = 1 - Afc_i ^^Wj/Xj^f^ . 

For clarity of the analysis, we will assume that Step 1 of the algorithm produces a perfectly 
isotropic mixture. Theorem d] gives a bound on the required number of samples to make the 
distribution nearly isotropic, and as our analysis shows, our algorithm is robust to small estimation 
errors. 

We will also assume for convenience of notation that the the unit vectors along the first k— 1 coor- 
dinate axes ei, . . . e^-i span the intermean (i.e. Fisher) subspace. That is, F = spanjei, . . . , e^-i}. 
When considering this subspace it will be convenient to be able to refer to projection of the mean 
vectors to this subspace. Thus, we define /2j € R k ~ 1 to be the first k — 1 coordinates of fif, the 
remaining coordinates are all zero. In other terms, 

fii = [4-1 0] m . 

In this coordinate system the covariance matrix of each component has a particular structure, 
which will be useful for our analysis. For the rest of this paper we fix the following notation: an 
isotropic mixture is defined by {wj,/ij,Sj}. We assume that spanjei, . . . ,e/c_i} is the intermean 
subspace and Ai,Bi, and Di are defined such that 



WjE,- 



* BT (2) 



Bi A 

where A{ is a (k — 1) x (k — 1) submatrix and Di is a (n — k + 1) x (n — k + 1) submatrix 
Lemma 4 (Covariance Structure). Using the above notation, 

Pill < <P , HAH < l , IIAII < 

for all components i. 
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Proof of Lemma^ Because spanjei, . . . , ek-i} is the Fisher subspace 



k 

^gm^- 1 \\v II 2 ^ 
1=1 



max 77 ^777 > ' V T AiV 



i=l 



Also X^i=i = Ii so II X/i=i = 1. Each matrix WjSj is positive definite, so the principal minors 
A{,Di must be positive definite as well. Therefore, ||-Aj|| < (ft, \\Di\\ < 1, and ||-Bj|| < = 
y/4> using Fact [3j □ 

For small (ft, the covariance between intermean and non-intermean directions, i.e. Bi, is small. 
For k = 2, this means that all densities will have a "nearly parallel pancake" shape. In general, it 
means that k — 1 of the principal axes of the Gaussians will lie close to the intermean subspace. 

We conclude this section with a proposition connecting, for k = 2, the overlap to a standard 
notion of separation between two distributions, so that Theorem [1] becomes an immediate corollary 
of Theorem [2j 

Proposition 5. If there exists a unit vector p such that 

\p T (pi - // 2 )| > t(y / p T w 1 T 11 p + ^p T w 2 T, 2 p), 
then the overlap (ft < J(p) < (1 + wiW2i 2 )~ 1 . 

Proof of Proposition O Since the mean of the distribution is at the origin, we have wip T fii = 
— W2p T /i2- Thus, 

\p Hi-p H2\ = VP Pi) + VP PV + 2 \P Pi\\P P2\ 

< T ^2 ( 1 1 9 X 

(wip /tii) ( ^ + Z72 + 



W 2 Wj W1W2 / 



using W1 + W2 = 1. We rewrite the last factor as 



11 2 wf + w| + 2wiw 2 1 1/11 

W 2 W 2 W1W2 W 2 W 2 W 2 w| W1W2 \Wi W2 

Again, using the fact that w\p T = —W2p T p2, we have that 

I T T |2 (wi£> T ^i) 2 /ll 

W1W2 \Wl W2 
Wl(p T jUi) 2 + W 2 (p T p2) 2 
W1W2 

Thus, by the separation condition 

Wl(p T /il) 2 + W 2 (p T P2) 2 = WlW 2 |p T /il -P T P2\ 2 > WiW2t 2 (p T WiSip + p T W 2 S2P)- 

To bound J(p), we then argue 

p T wiT,ip + p T w 2 S 2 p 



J(p) 



wi(p T Eip + VP T pi) 2 ) + w 2 (p T T,2P + {p T H2) 2 ) 

W 1 (p T fl 1 ) 2 + W 2 {p T p2) 2 



< 1 



wi(p T Eip + (p T pi) 2 ) + W 2 (p T S2P + VP T P2) 2 ) 
wiw 2 t 2 (wip T 5]ip + W2P T T I 2P) 



w 1 (p T T, 1 p + (p T /Ui) 2 ) + w 2 (p T S 2 p + {p T P2) 2 ) 

< 1 — W\W2t 2 J(p), 

and J(p) < 1/(1 + wiw 2 i 2 ). □ 
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3.3 Approximation of the Reweighted Moments 



Our algorithm works by computing the first and second reweighted moments of a point set from T. 
In this section, we examine how the reweighting affects the second moments of a single component 
and then give some approximations for the first and second moments of the entire mixture. 



3.3.1 Single Component 

The first step is to characterize how the reweighting affects the moments of a single component. 
Specifically, we will show for any function / (and therefore x and xx T in particular) that for a > 0, 



E 



/(x)exp 



x 



2a 



^WiPiEi [fivi)} , 



Here, E{['] denotes expectation taken with respect to the component i, the quantity pi = Ei exp 
and yi is a Gaussian variable with parameters slightly perturbed from the original ith component. 



2< 



Claim 6. If a = n/w, the quantity pi = Ei 



exp 



2a 



is at least 1/2. 



Proof. Because the distribution is isotropic, for any component i, WiEi[\\x\\ 2 ] < n. Therefore, 



Ei 



exp 



2a 



> Ei 



1 2 n 



2a 



1 n 1 

> 1 > -• 

2a Wi 2 



□ 



Lemma 7 (Reweighted Moments of a Single Component). For any a > 0, with respect to a single 
component i of the mixture 



Ei 



xexp 



2a 



Pi{Vi Sj/ii + /) 



a 



and 



\\F 



T 

xx exp 



2a 



+ piPi (EjSj + pifii Ei + EiPiPi ) + F) 

a 



where ||/||,||F|| =0(a~ 2 ). 

We first establish the following claim. 

Claim 8. Let x be a random variable distributed according to the normal distribution N(p,, X) and 
let E = QAQ T be the singular value decomposition o/S with Ai, . . . , A n being the diagonal elements 
of A. Let W = diag(a/(a + Ai), . . . , a/ [a + X n )). Finally, let y be a random variable distributed 
according to N(QWQ T p, QWKQ T ). Then for any function f(x), 



E 



f{x) exp 



2a 



Proof of Claim We assume that Q = I for the initial part of the proof. From the definition of a 
Gaussian distribution, we have 



E 



f(x) exp 



\\x\\ 
~2a 



det(A)^ 1/2 (2vr)-"/ 2 f /(x)exp 



x T x (x — p) T A l {x — pY 



2a 
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Because A is diagonal, we may write the exponents on the right hand side as 

n n 



i=l t=l 

Completing the square gives the expression 

n / \ 2 



Eh 



i=i 



a 



Mi 



A,- a 



a 



a + Aj Va + Aj ' r ~ 1 '" 1 ™ 1 a + A* 

The last two terms can be simplified to fJ%/(a + Aj). In matrix form the exponent becomes 

(x - Wfif (Wk)~ l (x - Wfi) + fj^Wfjia- 1 . 
For general Q, this becomes 

(x - QWQ T pf Q{WK)- 1 Q T (x - QWQ T p) + p T QWQ T pa~ l . 
Now recalling the definition of the random variable y, we see 

12X1 ' p T QWQ T p 



E 



f(x) exp 



2a 



det(A)- 1/2 (27r)- n / 2 exp 



2a 



f(x) exp 



i - g^Q T /i) T Q(^A)" 1 g T (x - QWQ T p) 



det(W) 1/2 exp 



p T QWQ T p 
2a 



E[f(y)] 



□ 



The proof of Lemma [7] is now straightforward. 



Proof of Lemma^ For simplicity of notation, we drop the subscript i from pi, pi, E with the 
understanding that all statements of expectation apply to a single component. Using the notation 
of Claim we have 



p = E 



exp 



2a 



det(W0 1/2 exp 



A diagonal entry of the matrix W can expanded as 

a Xi Xi 



a + Xi 



1 



a + Xi 



1-- + 



p T QWQ T p 
2a 



A? 



a a(a + Aj)' 



so that 



Thus, 



E 



x exp 



2a 



W = I- -A+ 4tWA 2 . 

a or 



p(QWQ T p) 

p(QIQ T p - —QAQ T p + ±-QWk 2 Q T p) 



a 



a- 



p(p Efi + f), 

a 
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where ||/|| = 0(a- 2 ). 

We analyze the perturbed covariance in a similar fashion. 

2" 



E 



T 

xx exp 



X 



2a 



where [|.F[| = 0{a 



p (Q(WA)Q T + QWQ nfjrQWQ ) 
p (qAQ t - —QA 2 Q T + -\qwA 3 Q t 



a 



+ (/.-is /U + /)( / u-is / i + /) T 
a a 



1 



p ( E + /V - -(EE + /i/i E + E/V ) + F ) , 



□ 



3.3.2 Mixture moments 



The second step is to approximate the first and second moments of the entire mixture distribution. 
Let p be the vector where pi = Ei j^exp ^— %!jrj J and let p be the average of the pi. We also define 



u = E 



M = E 



x exp 



T 

xx exp 



2a 



(3) 



2a 



= y~] wipmi - - y~] wip^iPi + / 

^— ' a 

i=i i=i 

M - - 1 - 

) =yjwiPi(Sj + )Ui/xf (EjEj + pip^Yji + Ej/ij/xf )) + (4) 

/ -I i=i a 



with ll/H = O(o 2 ) and ||F|| = O(o 2 ). We denote the estimates of these quantities computed 
from samples by u and M respectively. 



Lemma 9. Let v = ^i=iP« w «/ / «- Then 



I 112 <r 4k 
\u — v\\ < 



a 2 w 



Proof of Lemma{^ We argue from Eqn. [2] and Eqn. [3] that 

k 



u — V 



1=1 



+ 0(cT 



< 



< 



1 fe 

-/= y^Pill( w i s i)(v / WiW)ll + 0(a~ 

/YKT ^ 



a-v/w 



i=l 
fc 



a: 



* i=l 



Wi/i»)[| + 0(a~ 



From isotropy, it follows that Hy^/i^l < 1. To bound the other factor, we argue 

\\[Ai,Bf] T \\ < v^maxO^IUlBiH} < ^20. 

Therefore, 



\u — v\\ < 



2k 2 
arw 



+ 0(a- 3 ) < 



Ak 2 
a z w 



for sufficiently large n, as a > n/w. 



□ 
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Lemma 10. Let 



Ya=1 Pi{wifiifii T + Aj) 



If\\p- Iplloo < l/(2a), */ien 



2 7.2 



IM-nis < 



16 2 k 



2 — 2 2^" 



Before giving the proof, we summarize some of the necessary calculation in the following claim. 
Claim 11. The matrix of second moments 



M = E 



T 

xx exp 



2a 



Tn 

o r 22 



+ 



An A£ 
A21 A 2 2 



+ F, 



where 



fe 

r 22 = y^PiD 



—Df 



w,-a 



An = - -^—BjBi + (wifiifi^Ai + w i A i (l i ili T + Af) 
w, a w.-/t 



i=l 



W;« 



A21 = J2p i B i --^(B i (w i 4l i fi i T )+B l A l + D i B i ) 

^ — ' w.rv x ' 



i=l 



w,-a 



A22 = s^f, 

w, a 



i=i 



and \\F\\ = 0{a~ 2 ). 

Proof. The calculation is straightforward. 



□ 



Proof of Lemma [70i We begin by bounding the 2-norm of each of the blocks. Since ||wj/Ii/Ii T || < 1 
and ||Aj|| < <f> and \\Bi\\ < yf<j), we can bound 



max V -^y T B?B iy T - -^y T A* + WiAifiifii T + A 2 Ay + 0(a 2 ) 

y =1 r-i w,-a w,-a 



< y J^||B i || 2 + -«-(2||>l|| + ||A|| 2 ) + (9( 
^— ' Wja w,a 

i=i 

WQ 



By a similar argument, HA22II < k<fi/(wa) + 0(a ). For A21, we observe that £)i=i -Bj = 0. 
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Therefore, 

IIA21II < 



J2(pi - p)Bi 



k 



+ 



V J?- (BiiwijUfi) + BiA t + DiBi) + 0{a~ 2 ) 

i=l 

k 

< Y\\pi~ P\\\Bi\\ + (WBifriiUffiW + \\BiAi\\ + \\DiBi\\) + 0(a~ 2 ) 
i=i i=i 

k 

< fc||p-lp|| ooV / + Y— (y^ + <t>y^+y^) + 0(a- 2 ) 

' vtr-rv 



< k\\p - l/o||ooV^+ — \R> 
7k 



i=l 

3kp 
wa 



< 



2wa 



y/4> + 0(a 



Thus, we have max{|| An ||, HA22II, || A21 1|} < 4/cy / ^/(wa) + 0(a 2 ), so that 

||M-r|| < ||A|| +0(a~ 2 ) <2max{||A 11 ||,||A 22 ||,||A2i||} < — y^ + 0( a - 2 ) < — y^- 

wa wa 

for sufficiently large n, as a > ra/w. □ 
3.4 Sample Convergence 

We now give some bounds on the convergence of the transformation to isotropy (/t — ► and £ — ► J) 
and on the convergence of the reweighted sample mean u and sample matrix of second moments 
M to their expectations u and M. For the convergence of second moment matrices, we use the 
following lemma due to Rudelson [12], which was presented in this form in |13j . 

Lemma 12. Let y be a random vector from a distribution D in R n , with sup^ = M and 
\\E(yy T )\\ < 1. Let yi, . . . , y m be independent samples from D. Let 



7] = CM 



log m 



in 



where C is an absolute constant. Then, 
(i) If rj < 1, then 



(ii) For every t S (0, 1), 



This lemma is used to show that a distribution can be made nearly isotropic using only 0*(kn) 
samples \12\ I10| . The isotropic transformation is computed simply by estimating the mean and 
covariance matrix of a sample, and computing the affine transformation that puts the sample in 
isotropic position. 
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Theorem 4. There is an absolute constant C such that for an isotropic mixture of k logconcave 
distributions, with probability at least 1 — 5, a sample of size 



m> C 



kn log 2 (n/S) 



gives a sample mean ft and sample covariance £ so that 

\\fi\\ < e and ||S — /|| < e. 
We now consider the reweighted moments. 

Lemma 13. Let e,5 > and let fi be the reweighted sample mean of a set of m points drawn from 
an isotropic mixture of k Gaussians in n dimensions, where 

2na 2n 
m > -^-log— . 



Then 



P[||«-u|| > e] < 5 



Proof. We first consider only a single coordinate of the vector u. Let y = x\ exp (— ||x|| 2 /(2a)) — u\. 
We observe that 



x\ exp 



2o 



<l*i|exp(-|M< J2<V5. 



Thus, each term in the sum mu\ = Y^j=i Vj f &ns the range [—^/a — u\, y/a — u\\. We may therefore 
apply Hoeffding's inequality to show that 



P -ui| > c/Vn| < 2exp ( -^0^ ) < 2exp 



me \ 5 
2an J ~ n 



Taking the union bound over the n coordinates, we have that with probability 1 — 5 the error in 
each coordinate is at most e/^/n, which implies that ||u — u\\ < e. □ 

Lemma 14. Let e, 5 > and let M be the reweighted sample matrix of second moments for a set 
of m points drawn from an isotropic mixture of k Gaussians in n dimensions, where 

na na 
m > Ci-^-log— . 
e z o 



and C\ is an absolute constant. Then 



M-M 



> e 



< S. 



Proof. We will apply Lemma [12j Define y = xexp 

|2 



7 (2a)). Then, 



2 ^ 2 

Vi < x i exp 



\x\ 



a 



9 I xj \ a 

< xf exp L < — < a. 

a J e 



Therefore < yjcm. 

Next, since M is in isotropic position (we can assume this w.l.o.g.), we have for any unit vector 

E((v T y) 2 )) < E({v T x) 2 ) < 1 
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and so ||E(yy T )|| < 1. 

Now we apply the second part of Lemma E] with r\ = t^J c/ ln(2/£) and t = rjy / ln(2/5)/c. This 
requires that 

ce — /logm 

77 = - — - — — < Cyani 



ln(2/<f) ~ v V m 

which is satisfied for our choice of m. □ 

Lemma 15. Let X be a collection of m points drawn from a Gaussian with mean p and variance 

a 2 . With probability 1—5, 

\x — p\ < ay/2\ogm/5. 

for every x £ X. 



3.5 Perturbation Lemma 

We will use the following key lemma due to Stewart |16j to show that when we apply the spectral 
step, the top k — 1 dimensional invariant subspace will be close to the Fisher subspace. 

Lemma 16 (Stewart's Theorem). Suppose A and A + E are n-by-n symmetric matrices and that 



D x 
D 2 

r n — r 



r 

n — r 



E 



En E21 
E21 E22 
r n — r 



r 

n — r 



Let the columns of V be the top r eigenvectors of the matrix A + E and let P2 be the matrix with 
columns e r -\-ii ■ ■ ■ > e n- 

Ifd = Xr(Di) - Xi(D 2 ) > and 

rir,.l d 

\m\<- 5 , 



then 



\V T P 2 \\ < ^\\E 2 i\\2. 
a 



4 Finding a Vector near the Fisher Subspace 

In this section, we combine the approximations of Section 13.31 and the perturbation lemma of 
Section 13.51 to show that the direction h chosen by step 3 of the algorithm is close to the intermean 
subspace. Section [5] argues that this direction can be used to partition the components. Finding 
the separating direction is the most challenging part of the classification task and represents the 
main contribution of this work. 

We first assume zero overlap and that the sample reweighted moments behave exactly according 
to expectation. In this case, the mean shift u becomes 

k 

1=1 

We can intuitively think of the components that have greater pi as gaining mixing weight and 
those with smaller pi as losing mixing weight. As long as the pi are not all equal, we will observe 
some shift of the mean in the intermean subspace, i.e. Fisher subspace. Therefore, we may use 
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this direction to partition the components. On the other hand, if all of the pi are equal, then M 
becomes 







+ Ai 



n .r>. £2_n 2 



I 





I 







-Di 



Notice that the second moments in the subspace spanjei, . . . , et-i} are maintained while those in 
the complementary subspace are reduced by poly(l/a). Therefore, the top eigenvector will be in 
the intermean subspace, which is the Fisher subspace. 

We now argue that this same strategy can be adapted to work in general, i.e., with nonzero 
overlap and sampling errors, with high probability. A critical aspect of this argument is that the 
norm of the error term M — T depends only on <j> and k and not the dimension of the data. See 
Lemma [10] and the supporting Lemma [4] and Fact(3j 

Since we cannot know directly how imbalanced the pi are, we choose the method of finding a 
separating direction according the norm of the vector \\u\\. Recall that when \\u\\ > y / w/(32a) the 
algorithm uses u to determine the separating direction h. Lemma [J7] guarantees that this vector 
is close to the Fisher subspace. When < ^/w/(32a), the algorithm uses the top eigenvector of 
the covariance matrix M. Lemma [JS] guarantees that this vector is close to the Fisher subspace. 

Lemma 17 (Mean Shift Method). Let e > 0. There exists a constant C such that if mi > 
Cn poly (fc,w _1 , log n/S), then the following holds with probability 1 — 5. If \\u\\ > y / w/(32a) and 



then 



< 



I -T I 
\U V\ 



U V 



w 2 e 



> 1 - e. 



Lemma 18 (Spectral Method). Let e > 0. There exists a constant C such that if mi > Cn 4 poly(/c, w _1 , log n/5), 
then the following holds with probability 1 — 5. Let v i, . . . , v^-i be the top k — 1 eigenvectors of M. 
If \\u\\ < y/w/(32a) and 

6 < 



w 2 e 



then 



4.1 Mean Shift 



640 2 /c 2 



min [|projjr(v)|| > 1 

uesp£Ul{ui,...,Ufc_l},||«[|=l 



Proof of Lemma 11, We will make use of the following claim. 
Claim 19. For any vectors a,b ^ 0, 

,Tl| / ll„ ilia \ V2 



\a r b\ 



> 1 



\a-bf 



\a\\\\b\\ ~ \' max{||a|| 2 ,||6|| 2 }, 
By the triangle inequality, \\u — v\\ < \\u — u\\ + \\u — v\\. By Lemma El 



\u — v\\ < 



Ak 2 
a z w 



Ak 2 
a 2 w 



w 2 e 
2Hk 2 



< 



we 
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By Lemma [TBI for large mi we obtain the same bound on \\u — u\\ with probability 1 — 5 . Thus, 



u — v\\ < 



we 



Applying the claim gives 



I "T n II * 112 

\u v\\ ^ | a - r\ - 



\u\\\\v\\ ll^ll 2 



> 1 



we 32 2 a 2 



2 w a 2 
1 - e. 



w 



□ 



Proof of Claim [7P1 Without loss of generality, assume ||u|| > \\v\\ and fix the distance \\u — v\\. In 
order to maximize the angle between u and v, the vector v should be chosen so that it is tangent to 
the sphere centered at u with radius \\u — v\\. Hence, the vectors u,v,(u — v) form a right triangle 
where ||it|| 2 = ||v|| 2 + \\u — v\\ 2 . For this choice of v, let 9 be the angle between u and v so that 

T / II Il2\ !/ 2 

u v „ , „,i/, / \\u — vy ■ 



\u\\ \\v\ 



cos^ = (l-sin 2 g) 1 / 2 = (l- l|M ^ ^" 



□ 



4.2 Spectral Method 

We first show that the smallness of the mean shift u implies that the coefficients pi are sufficiently 
uniform to allow us to apply the spectral method. 

Claim 20 (Small Mean Shift Implies Balanced Second Moments). If \\u\ < y / w/(32a) and 

r- w 

then 

I!p-i*<^. 

Proof. Let qi, . . . , q k be the right singular vectors of the matrix U = [w^pi, . . . , w k p, k ] and let (Ji(U) 
be the ith largest singular value. Because Yli=i w iPi = 0> we have that a k (U) = and q k = l/y/k. 
Recall that p is the k vector of scalars p\ , . . . , p k and that v = Up. Then 

\H 2 = \\Upf 
fc-i 



i=i 

> a k -i(U) 2 \\p-q k (q k iJjii 2 

o- k -i(u) 2 \\p-ip\\i 



2|| I T M|2 

\\P-Qk{q k " 

2n„ t — M 2 
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Because qu-i G span{/ii, . . . , we have that Wiq^^pifiJ q k -i > 1 — <fi. Therefore, 

^(C/) 2 = \\Uq k ^\\ 2 



= Qk-1 (^2 w iWfj Qk-l 

> w Qk-i \^wMt*[ \ Qk-i 

> w(l -6). 



Thus, we have the bound 



l|p -* s 7!P* H|s ^ 114 

By the triangle inequality ||v||<||it|| + ||ii — v\\. As argued in Lemma [9l 



, 4fc 2 / 4/c 2 w 2 Jw 

it — v\\ < \ — ^ — <b = \ — ^ — • „ n =< 



a 2 w V a 2 w 64 2 /c 2 32a 
Thus, 



2p .. 

\p-lp\\oo < —7=\\V\ 



W 



< _^P_ ( + \/w 



< 



y/w \32a 32a 
1 

8a' 



□ 



We next show that the top k — 1 principal components of T span the intermean subspace and 
put a lower bound on the spectral gap between the intermean and non-intermean components. 

Lemma 21 (Ideal Case). // \\p — lp||oc < l/(8a), i/ien 

A fc _i(T) - A fe (r) > -L 
4a 

and i/ie top fe — 1 eigenvectors of T span the means of the components. 
Proof of Lemma [21[ We first bound Afc-i(rn). Recall that 

k 

Tn = ^2 Pi{^ifiifii T + Ai). 
i=i 

Thus, 



Afc-i(rn) = min piy T (wipZiPi T + Ai)y 

> p - max VVp - pi)y T (wijlifii T + A» 



V . 
1=1 
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We observe that Yli=i y T ( w iPiPi T + Ai)y = 1 and each term is non-negative. Hence the sum is 
bounded by 

k 

^2(p - Pi)y T (wniifii T + A)y < \\p - ip||oo, 
i=i 

so, 

Afc-i(rn) > p- \\p- ip||oo- 

Next, we bound Ai(r 2 2)- Recall that 



T 22 = Y,^-^Df 

i=l 

and that for any n — k vector y such that ||y|| = 1, we have Yli=i V T ^iV = 1- Using the same 
arguments as above, 



Ai (r 22 ) = max p + y^(pi - p)y T Diy - -^—y T D}y 
v =i r-f Wj« 



A; 

< p+||p-lp|| c - ^ Pi -rn2. 



— ?/ Ay- 

||y||-L . Wi« 
1=1 



To bound the last sum, we observe that pi — p = 0(a 1 ). Therefore 

J2 ^V T Dh >tj2 -V T Dh + O(o- 2 ). 
f-f w^a a f-f Wi 

t=i i=i 

Without loss of generality, we may assume that y = e\ by an appropriate rotation of the D{. Let 
Di(£,j) be element in the £th row and j'th column of the matrix Dj. Then the sum becomes 

k 



i=i 



Because Yli=i A = ^> we ^- ave Yli=i A(lj 1) = 1- From the Cauchy-Schwartz inequality, it follows 

V2 / , x 1/2 



' A; \ ' / fc i \ k 



A(M) 



Wj = 1. 

a/wJ 



Since Yli=i w « = 1> we conclude that Yli=i ^7~A(1, I) 2 > 1- Thus, using the fact that p > 1/2, we 
have 

k 



Eft T t-,2 ^ 1 

Putting the bounds together 



Wja 1 2a 



Afc-i(rn) - Ai(r 22 ) > ^- - 2\\ P - lpiu > 

2a 4a 



□ 
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Proof of Lemma [TR To bound the effect of overlap and sample errors on the eigenvectors, we apply 
Stewart's Lemma (Lemma [TBI) . Define d = A&_i(r) — Afc(r) and E = M — V . 

We assume that the mean shift satisfies < y / w/(32a) and that eft is small. By Lemma [21] 
this implies that 

d = A fc _i(r)- A fc (r) > i-. (5) 

To bound \\E\\, we use the triangle inequality \\E\\ < \\T — M\\ + \\M — M\\. Lemma [TU] bounds 
the first term by 



W 2 k 2 W 2 k 2 w 2 e 1 r 

11 " 11 " V ^cJ * ~ V ^hJ ' 6402P - 40a £ " 

By Lemma [TU we obtain the same bound on \\M — M\\ with probability 1 — 5 for large enough mi. 
Thus, 

Combining the bounds of Eqn. [5] and 14. 2[ we have 



as yjl - (1 - e) 2 > yfl. This implies both that ||£|| < d/5 and that 4\\E 2 i\/d < a/1 - (1 - e) 2 , 
enabling us to apply Stewart's Lemma to the matrix pair T and M. 

By Lemma |2T| the top k — 1 eigenvectors of T, i.e. e\, . . . , efc-i, span the means of the com- 
ponents. Let the columns of Pi be these eigenvectors. Let the columns of P 2 be defined such that 
[Pi, P2] is an orthonormal matrix and let v\, . . . , Vk be the top k—1 eigenvectors of M. By Stewart's 
Lemma, letting the columns of V be v\, . . . , Vk-i, we have 

||y r P 2 || 2 < ^l-(l-e) 2 , 

or equivalently, 

min ||P r oji?v|| = Ck-iiy 1 'Pl) > 1 — £• 

u£span{ui,...,ufc_i},|[u||=l 

□ 



5 Recursion 

In this section, we show that for every direction h that is close to the intermean subspace, the 
"largest gap clustering" step produces a pair of complementary halfspaces that partitions W 1 while 
leaving only a small part of the probability mass on the wrong side of the partition, small enough 
that with high probability, it does not affect the samples used by the algorithm. 

Lemma 22. Let 5,5' > 0, where 5' < 5/{2rri2), and let mi satisfy mi > n / k\og{2k / 5) . Suppose 
that h is a unit vector such that 

w 

l|pro jF W||>l- 2l0(fc _ 1)2lQgj7 . 

Let J 7 be a mixture of k > 1 Gaussians with overlap 

w 1 1 

^ 29(fc-l) 2 lQg ~ d 1 ' 



22 



Let X be a collection of m 2 points from T and let t be the midpoint of the largest gap in set 
{h T x : x G X}. With probability 1 — 5, the half space H^j has the following property. For a random 
sample y from T either 

V, Vl{y) G H h)t or y, [i l{y) £ H h)t 

with probability 1 — 5'. 

Proof of Lemma UM. The idea behind the proof is simple. We first show that two of the means are 
at least a constant distance apart. We then bound the width of a component along the direction 
h, i.e. the maximum distance between two points belonging to the same component. If the width 
of each component is small, then clearly the largest gap must fall between components. Setting t 
to be the midpoint of the gap, we avoid cutting any components. 

We first show that at least one mean must be far from the origin in the direction h. Let the 
columns of Pi be the vectors e±, . . . , e^-i- The span of these vectors is also the span of the means, 
so we have 

\2 



max(/i /ij) = max(/i Pi Pi fJ-i 

i i 



r 



> 



> \\p?h\\\i-<t>) 
i 

> 2" 

Since the origin is the mean of the means, we conclude that the maximum distance between two 
means in the direction h is at least 1/2. Without loss of generality, we assume that the interval 
[0, 1/2] is contained between two means projected to h. 

We now show that every point x drawn from component i falls in a narrow interval when 
projected to h. That is, x satisfies h T x G h, where bi = [h T m — (8(k — l)) -1 , h T \ii + (8(k — l)) -1 ]. 
We begin by examining the variance along h. Let e^, ... ,e n be the columns of the matrix n-by- 
(n — k+ 1) matrix P^. Recall from Eqn. [2] that P^WjSjPi = Ai, that P^WjSjPi = Bi, and that 
PlwiZiP 2 = Di. The norms of these matrices are bounded according to Lemma HI Also, the vector 
h = PiPfh + P 2 Prfh. For convenience of notation we define e such that = 1 — e. Then 

\\pT h\\ 2 = 1 - (1 - e) 2 < 2e. We now argue 

h T WiT,ih < (hTPiAiP^h + 2h T P 2 B i P 1 h + h T P 2 DiP 2 h) 

< 2 (h T P 1 A i P[h + h T P 2 DiP^h) 

< 2(||P 1 T / i || 2 p 4 || + ||P 2 T / i || 2 ||||All) 

< 2(0 + 2e). 

Using the assumptions about (j) and e, we conclude that the maximum variance along h is at most 

max h T T,ih < | log 1 + 2^^, log ^ < [*{k - I) 2 log 1/5') 1 . 

We now translate these bounds on the variance to a bound on the difference between the 
minimum and maximum points along the direction h. By Lemma \15\ with probability 1 — 5/2 

\h T (x - mx) )\ < J2h^ lh lo g (2m 2 /5) < -J— ■ l0g{2m2 / 6) < ' 



8(fc-l) log(l/<5') - 8(fc — 1) 
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Thus, with probability 1 — 5/2, every point from X falls into the union of intervals b\ U . . . U b^ 
where b{ = [h T \X{ — (8(k — l)) -1 , h T /ij + (8(k — 1))~ ]. Because these intervals are centered about 
the means, at least the equivalent of one interval must fall outside the range [0, 1 /2\ , which we 
assumed was contained between two projected means. Thus, the measure of subset of [0, 1/2] that 
does not fall into one of the intervals is 

1 1 1 

2 ~ (fc_1) 4(fc-l) ~T 

This set can be cut into at most k — 1 intervals, so the smallest possible gap between these intervals 
is (4(k — which is exactly the width of an interval. 

Because ui2 = k/wlog(2k/5) the set X contains at least one sample from every component with 
probability 1 — 5/2. Overall, with probability 1 — 5 every component has at least one sample and 
all samples from component i fall in 6j. Thus, the largest gap between the sampled points will not 
contain one of the intervals b\, . . . , Moreover, the midpoint t of this gap must also fall outside 
of b% U . . . U bk, ensuring that no bi is cut by t. 

By the same argument given above, any single point y from T is contained in b\ U . . . U b^ with 
probability 1 — 5' proving the Lemma. □ 

In the proof of the main theorem for large k, we will need to have every point sampled from 
T in the recursion subtree classified correctly by the halfspace, so we will assume 5' considerably 
smaller than 1x12/5. 

The second lemma shows that all submixtures have smaller overlap to ensure that all the relevant 
lemmas apply in the recursive steps. 

Lemma 23. The removal of any subset of components cannot induce a mixture with greater overlap 
than the original. 

Proof of Lemma [2R Suppose that the components j + 1, . . . k are removed from the mixture. Let 
co = Yli=i w i be a normalizing factor for the weights. Then if c = Xa=i w iMi = — J2i=j+i w il Ji ii the 
induced mean is co~ 1 c. Let T be the subspace that minimizes the maximum overlap for the full k 
component mixture. We then argue that the overlap 4> 2 of the induced mixture is bounded by 

7 LU~ 1 V T T,V 

min max ■ 



dim(S)=j-l veS w -l Y^ J i=1 WiV T (pitf - cc T + E f )u 

J2l=i WiV T T,iV 



< max 



i.espan{ei,...,e fc _ 1 }\span{/i j+ i,...,/i fc } Ya=i Wif T (/Xj^f — CC T + Sj)« 

Every v G spanjei, . . . , et-i} \span{/Uj + i, . . . , [ik} must be orthogonal to every fi£ for j + 1 < I < k. 
Therefore, v must be orthogonal to c as well. This also enables us to add the terms for j + 1, . . . , k 
in both the numerator and denominator, because they are all zero. 

7 v T 'Ev 
< max 



DGspan{ei,...,e fc _ 1 }\span{Ai j+ i,...,At fc } £\ =1 WiV T (fliflf + Ej)l) 

V T T,V 

< max 



Gspan{e 1 ,..,e fe _ 1 } £\ =1 Wjl! T (^j^f + 



□ 
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The proofs of the main theorems are now apparent. Consider the case of k = 2 Gaussians 
first. As argued in Section [3.41 using mi = Lo(kn 4 w~ 3 log(ra/<5w)) samples to estimate u and M is 
sufficient to guarantee that the estimates are accurate. For a well-chosen constant C, the condition 



SJWso ^kg->(-L + i) 



of Theorem [2] implies that 



where 



640 • 2 ' 



w _ x ( 2m 2 1 

The arguments of Section H] then show that the direction h selected in step 3 satisfies 

rp W 1 / mo 1 



IIA^II> I--!-*; log"' [~f + r] 

Already, for the overlap we have 

i— w^/e / w _ 1/2 1 

^- 640T2^V^(^F " T 

so we may apply Lemma [22] with 6' = (rri2/5 + 1/rf) . Thus, with probability 1 — 5 the classifier 
Hh,t is correct with probability 1 — 8' > 1 — r/. 

We follow the same outline for k > 2, with the quantity 1/(5' = 1112/8 + being replaced with 
1/(5' = m/5 + I/7/, where m is the total number of samples used. This is necessary because the 
half-space Hh,t must classify every sample point taken below it in the recursion subtree correctly. 
This adds the n and k factors so that the required overlap becomes 

^Cw^log- 1 (^ + - 

\0W 7] 

for an appropriate constant C. The correctness in the recursive steps is guaranteed by Lemma [ 
Assuming that all previous steps are correct, the termination condition of step 4 is clearly correct 
when a single component is isolated. 



6 Conclusion 

We have presented an affine-invariant extension of principal components. We expect that this 
technique should be applicable to a broader class of problems. For example, mixtures of distri- 
butions with some mild properties such as center symmetry and some bounds on the first few 
moments might be solvable using isotropic PCA. It would be nice to characterize the full scope of 
the technique for clustering and also to find other applications, given that standard PCA is widely 
used. 
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